Comparing corpus-based counts versus web page counts as estimates of lexical frequency
نویسنده
چکیده
The frequency of occurence of words plays an important role in human visual and auditory word recognition processes. Psychologist usually employ lexical frequencies estimates based on corpora consisting of books, journal articles... Using web pages as a sample is an alternative, and attractive option (Blair, 2002). Several Internet search engines report the number of pages containing a given target word (this number is called the ‘web page hits’). Thus, knowing the total number of pages indexed by the engine, one can compute the proportion of web pages containing a given item. It seems to us that Internet might better reflect the current word usage than the type of texts (books and journal articles) which are included in the corpora used by Psycholinguists. It should be obvious that page hit rate is not a direct estimate of the lexical frequency of the item. Firstly, repetitions on the same page are not taken into account. Second, consider the English item ‘the’ which probably appears in most English web pages, and therefore has a percentage of hits nearing 100%: yet, its frequency of occurrence in the language is obviously much less than 100%. For very high frequency items, therefore, one can expect the percentage of web page hits to be larger than the frequency of occurence in texts. On the other hand, very-low frequency items probably often occur in the same web pages, and therefore, their web page hits might be rather small and maybe underestimate their actual frequency. It would be nice if Internet search engines returned the number of occurences of a given target word in all the indexed pages, but they do not. Nevertheless, intuitively, when comparing two items, the page hits ratio may be a good approximation of the frequency ratio. This is what we assessed using the Lexique database (cf. http://www.lexique.org). The ’Graphemes’ table lists about 129000 word forms and provides, for each of them:
منابع مشابه
A Study of Using Search Engine Page Hits as a Proxy for n-gram Frequencies
The idea of using the Web as a corpus for linguistic research is getting increasingly popular. Most often this means using Web search engine page hit counts as estimates for n-gram frequencies. While the results so far have been very encouraging, some researchers worry about what appears to be the instability of these estimates. Using a particular NLP task, we compare the variability in the n-g...
متن کاملData Mining at the Intersection of Psychology and Linguistics
Large data resources play an increasingly important role in both linguistics and psycholinguistics. The first data resources used by both psychologists and linguists alike were word frequency lists such as Thorndike and Lorge (1944) and Kučera and Francis (1967). Although the Brown corpus on which the frequency counts of Kučera and Francis were based was very large for its time, comprising some...
متن کاملOn the Instability of Using Search Engine Page Hits as a Proxy for n-gram Frequencies
The idea of using the Web as a corpus for linguistic research is getting increasingly popular. Most often this means using page hit counts as an estimate for n-gram frequencies. While the results so far have been very encouraging, there are also some problems, the most important of which is the instability of these estimates. Using a particular NLP task, we find substantial variability in the n...
متن کاملA Web Search Engine-Based Approach to Measure Semantic Similarity between Words
easuring the semantic similarity between words is an important component in various tasks on the web such as relation extraction, community mining, document clustering, and automatic metadata extraction. Despite the usefulness of semantic similarity measures in these applications, accurately measuring semantic similarity between two words (or entities) remains a challenging task. We propose an ...
متن کاملA Web Search Engine-based Approach to Measure Semantic Similarity between Words
Measuring the semantic similarity between words is an important component in various tasks on the web such as relation extraction, community mining, document clustering, and automatic metadata extraction. Despite the usefulness of semantic similarity measures in these applications, accurately measuring semantic similarity between two words (or entities) remains a challenging task. We propose an...
متن کامل